1 In this vignette

In this data integration vignette, we will be taking the output from the Preprocessing vignette and merging our unique samples together into one unified Seurat object. We will also perform dimensional reduction on the integrated Seurat object. Herein, we will do the following:

  1. Normalize the RNA and ADT assays
  2. Identify integration features
  3. Run PCA on each sample independently
  4. Integrate data using rPCA
  5. Perform non-linear dimensional reduction
  6. Visualize unclustered dimensional reduction

There are multiple methods for data integration, some of which have been incorporated into Seurat V4. In this experiment, all four samples were prepared simultaneously and ultimately run on the 10X Chromium platform on the same chip. Therefore, we expect batch effects to be very small. For computational efficiency, we have integrated these data sets using the rPCA (reciprocal PCR) approach.

2 Prepare the R environment

2.1 Load in the necessary libraries

library(Seurat)
library(ggplot2)
library(dplyr)
library(cowplot)
library(SeuratWrappers)
library(plotly)
library(RColorBrewer)

2.2 Load in our Seurat object list

This .rds file comes from Vignette #1 - Data Preprocessing.

Seurat_list <- readRDS("../1_Preprocessing_Vignette/1_Preprocessing.rds")

3 Prepare Seurat objects for integration

3.1 Normalize RNA assay

First, we will normalize and find variable features for the RNA assay containing our gene expression data. The variable features will be used to determine integration features.

Seurat_list <- lapply(X = Seurat_list, FUN = function(x) {
    x <- NormalizeData(x, assay="RNA", normalization.method = "LogNormalize", verbose = FALSE)
    x <- FindVariableFeatures(x, assay="RNA", selection.method = "vst", nfeatures = 2000, verbose = FALSE)
})

3.2 Normalize ADT assay

Like for the HTO assay in the previous vignette, we will use CLR normalization for the ADT assay.

Seurat_list <- lapply(X = Seurat_list, FUN = function(x) {
    x <- NormalizeData(x, assay="ADT", normalization.method="CLR", verbose= FALSE)
    x <- ScaleData(x, assay= "ADT", verbose= FALSE)
})

3.3 Select integration features

The rPCA approach requires scaling of the data and subsequent PCA of each Seurat object prior to integration. Here, we are scaling on the integration features to produce our PCA results.

features <- SelectIntegrationFeatures(object.list = Seurat_list)
Seurat_list <- lapply(X = Seurat_list, FUN = function(x) {
    x <- ScaleData(x, features = features, verbose = FALSE)
    x <- RunPCA(x, features = features, verbose = FALSE)
})

4 Perform integration using rPCA

Data integration retains metaData for each individual Seurat object. The “project name”, or diet in this experiment, is stored in the orig.ident metaData column.

4.1 Determine integration anchors

options(future.globals.maxSize= 5242880000) #Integration requires a larger global size limit than default. I've set it to 5gb.
anchors <- FindIntegrationAnchors(object.list = Seurat_list, anchor.features = features, reduction = "rpca", verbose=FALSE)
Some cell names are duplicated across objects provided. Renaming to enforce unique cell names.

  |                                                  | 0 % ~calculating  
  |+++++++++                                         | 17% ~31s          
  |+++++++++++++++++                                 | 33% ~25s          
  |+++++++++++++++++++++++++                         | 50% ~21s          
  |++++++++++++++++++++++++++++++++++                | 67% ~13s          
  |++++++++++++++++++++++++++++++++++++++++++        | 83% ~07s          
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=42s  

4.2 Integrate objects and generated an integrated assay

A new assay, called integrated will be generated upon data integration. We will use this assay for clustering, but will always revert to the RNA assay for differential expression.

data.integrated <- IntegrateData(anchorset = anchors)
Merging dataset 2 into 3
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 4 into 3 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 1 into 3 2 4
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data

5 Dimensional Reduction

As mentioned before, the integrated data assay will be used for clustering and visualization.

5.1 Scale the data and perform PCA on the integrated assay

PCA is a linear dimensional reduction approach that seeks to identify the most variable sets of features. I’ve chosen to run 50 PCs for this experiment.

# Change default assay
DefaultAssay(data.integrated) <- "integrated"

# Scale the data and perform PCA on the integrated assay
data.integrated <- ScaleData(data.integrated, assay="integrated", verbose=FALSE)
data.integrated <- RunPCA(data.integrated, assay="integrated", npcs = 50, verbose=FALSE)

5.2 Identify how many dimensions to use

Multiple methods can be used to determine how many PCs to use for downstream analysis. Here, we use both a statistical approach JackStraw and a heuristic approach ElbowPlot as recommended in Seurat’s PBMC3k tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html.

5.2.1 JackStraw

First, we will calculate and score via JackStraw. Warning: Jackstraw takes a long time to run.

5.2.2 ElbowPlot

The ElbowPlot simply ranks PCs by percentage of variance.

ElbowPlot(data.integrated, ndims=50)

5.3 Nonlinear Dimensional Reduction

Uniform Manifold Projection (UMAP) is the preferred visualization for large datasets. It runs more efficiently than tSNE and preserves global structure much better. However, both work well for visualization and we run both to store them as embeddings in case we need to look at both. We use all 50 PCAs, since all appear to be significant.

Note: We calculate here both 2-dimensional and 3-dimensional embeddings. 3-dimensional embeddings can be useful to capture interactions between cell types that are otherwise very different when only looking at two dimensions.

Warning: UMAP in particular seems extremely sensitive to changes in OS and to any updates in the underlying software. These changes seem to result in rotations or flips across axes, even when setting the same seed.

5.3.1 Calculate UMAP

data.integrated <- RunUMAP(data.integrated, dims=1:50, verbose=FALSE)
data.integrated <- RunUMAP(data.integrated, dims=1:50, n.components=3, reduction.name="umap3D", reduction.key="UMAP3D_")
17:27:43 UMAP embedding parameters a = 0.9922 b = 1.112
17:27:43 Read 33322 rows and found 50 numeric columns
17:27:43 Using Annoy for neighbor search, n_neighbors = 30
17:27:43 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:27:48 Writing NN index file to temp file /tmp/Rtmp9kJc24/file8fc364edc711
17:27:48 Searching Annoy index using 1 thread, search_k = 3000
17:27:59 Annoy recall = 100%
17:28:00 Commencing smooth kNN distance calibration using 1 thread
17:28:02 Initializing from normalized Laplacian + noise
17:28:04 Commencing optimization for 200 epochs, with 1503980 positive edges
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
17:28:25 Optimization finished

5.3.2 Calculate tSNE

data.integrated <- RunTSNE(data.integrated, dims=1:50, verbose=FALSE)
test2 <- test

6 Data Projection

6.1 2D visualization

d1 <- DimPlot(data.integrated, reduction="umap", label=F) + NoLegend() + labs(subtitle="UMAP")
d2 <- DimPlot(data.integrated, reduction="tsne", label=F) + NoLegend() + labs(subtitle="tSNE")
d1+d2

6.2 3D visualization

For 3-dimensional visualization, we use the plotly packaged to produce an interactive plot. ### UMAP

plot.data <- FetchData(object = data.integrated, vars = c("UMAP3D_1", "UMAP3D_2", "UMAP3D_3"))
plot_ly(data = plot.data, #3D interactive plot of the data
        x = ~UMAP3D_1, y = ~UMAP3D_2, z = ~UMAP3D_3, 
        type = "scatter3d", 
        mode = "markers", 
        marker = list(size = 0.5, width=0.5),
        hoverinfo="text")

6.2.1 tSNE

plot.data2 <- FetchData(object = data.integrated, vars = c("tSNE3D_1", "tSNE3D_2", "tSNE3D_3"))
plot_ly(data = plot.data2, #3D interactive plot of the data
        x = ~tSNE3D_1, y = ~tSNE3D_2, z = ~tSNE3D_3,
        type = "scatter3d", 
        mode = "markers", 
        marker = list(size = 0.5, width=0.5),
        hoverinfo="text")

7 Save Progress

saveRDS(data.integrated, file="2_DataIntegration.rds")

8 Session Information

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] msigdbr_7.4.1               VAM_0.5.3                   MASS_7.3-54                 org.Mm.eg.db_3.13.0         AnnotationDbi_1.54.0       
 [6] RColorBrewer_1.1-2          plotly_4.9.3                SeuratWrappers_0.3.0        velocyto.R_0.6              Matrix_1.3-4               
[11] celldex_1.2.0               SummarizedExperiment_1.22.0 Biobase_2.52.0              GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
[16] IRanges_2.26.0              S4Vectors_0.30.0            BiocGenerics_0.38.0         MatrixGenerics_1.4.0        matrixStats_0.59.0         
[21] formattable_0.2.1           viridis_0.6.1               viridisLite_0.4.0           cowplot_1.1.1               SingleR_1.0.1              
[26] clustree_0.4.3              ggraph_2.0.5                dplyr_1.0.6                 ggplot2_3.3.3               SeuratObject_4.0.1         
[31] Seurat_4.0.2.9004           SoupX_1.5.2                

loaded via a namespace (and not attached):
  [1] utf8_1.2.1                    reticulate_1.20               tidyselect_1.1.1              RSQLite_2.2.7                
  [5] htmlwidgets_1.5.3             grid_4.1.0                    BiocParallel_1.26.0           Rtsne_0.15                   
  [9] munsell_0.5.0                 ScaledMatrix_1.0.0            codetools_0.2-18              ica_1.0-2                    
 [13] future_1.21.0                 miniUI_0.1.1.1                withr_2.4.2                   colorspace_2.0-1             
 [17] filelock_1.0.2                knitr_1.33                    rstudioapi_0.13               SingleCellExperiment_1.14.1  
 [21] ROCR_1.0-11                   tensor_1.5                    pbmcapply_1.5.0               listenv_0.8.0                
 [25] labeling_0.4.2                GenomeInfoDbData_1.2.6        polyclip_1.10-0               bit64_4.0.5                  
 [29] farver_2.1.0                  pheatmap_1.0.12               rhdf5_2.36.0                  parallelly_1.25.0            
 [33] vctrs_0.3.8                   generics_0.1.0                xfun_0.23                     BiocFileCache_2.0.0          
 [37] doParallel_1.0.16             R6_2.5.0                      graphlayouts_0.7.1            rsvd_1.0.5                   
 [41] locfit_1.5-9.4                bitops_1.0-7                  rhdf5filters_1.4.0            spatstat.utils_2.1-0         
 [45] cachem_1.0.5                  DelayedArray_0.18.0           assertthat_0.2.1              promises_1.2.0.1             
 [49] scales_1.1.1                  gtable_0.3.0                  beachmat_2.8.0                globals_0.14.0               
 [53] goftest_1.2-2                 tidygraph_1.2.0               rlang_0.4.11                  splines_4.1.0                
 [57] lazyeval_0.2.2                spatstat.geom_2.1-0           BiocManager_1.30.15           yaml_2.2.1                   
 [61] reshape2_1.4.4                abind_1.4-5                   crosstalk_1.1.1               httpuv_1.6.1                 
 [65] rsconnect_0.8.18              tools_4.1.0                   ellipsis_0.3.2                spatstat.core_2.1-2          
 [69] jquerylib_0.1.4               ggridges_0.5.3                Rcpp_1.0.6                    plyr_1.8.6                   
 [73] sparseMatrixStats_1.4.0       zlibbioc_1.38.0               purrr_0.3.4                   RCurl_1.98-1.3               
 [77] rpart_4.1-15                  deldir_0.2-10                 pbapply_1.4-3                 zoo_1.8-9                    
 [81] ggrepel_0.9.1                 cluster_2.1.2                 tinytex_0.32                  magrittr_2.0.1               
 [85] RSpectra_0.16-0               data.table_1.14.0             scattermore_0.7               lmtest_0.9-37                
 [89] RANN_2.6.1                    pcaMethods_1.84.0             fitdistrplus_1.1-5            patchwork_1.1.1              
 [93] mime_0.10                     evaluate_0.14                 GSVA_1.40.1                   xtable_1.8-4                 
 [97] XML_3.99-0.6                  gridExtra_2.3                 compiler_4.1.0                tibble_3.1.2                 
[101] KernSmooth_2.23-20            crayon_1.4.1                  htmltools_0.5.1.1             mgcv_1.8-36                  
[105] later_1.2.0                   tidyr_1.1.3                   DBI_1.1.1                     ExperimentHub_2.0.0          
[109] tweenr_1.0.2                  dbplyr_2.1.1                  rappdirs_0.3.3                babelgene_21.4               
[113] igraph_1.2.6                  pkgconfig_2.0.3               spatstat.sparse_2.0-0         foreach_1.5.1                
[117] annotate_1.70.0               bslib_0.2.5.1                 XVector_0.32.0                doFuture_0.12.0              
[121] stringr_1.4.0                 digest_0.6.27                 sctransform_0.3.2             RcppAnnoy_0.0.18             
[125] graph_1.70.0                  spatstat.data_2.1-0           Biostrings_2.60.0             rmarkdown_2.8                
[129] leiden_0.3.8                  edgeR_3.34.0                  uwot_0.1.10                   DelayedMatrixStats_1.14.0    
[133] GSEABase_1.54.0               curl_4.3.1                    shiny_1.6.0                   outliers_0.14                
[137] lifecycle_1.0.0               nlme_3.1-152                  jsonlite_1.7.2                Rhdf5lib_1.14.2              
[141] limma_3.48.0                  fansi_0.5.0                   pillar_1.6.1                  lattice_0.20-44              
[145] KEGGREST_1.32.0               fastmap_1.1.0                 httr_1.4.2                    survival_3.2-11              
[149] remotes_2.4.0                 interactiveDisplayBase_1.30.0 glue_1.4.2                    iterators_1.0.13             
[153] png_0.1-7                     BiocVersion_3.13.1            bit_4.0.4                     ggforce_0.3.3                
[157] stringi_1.6.2                 sass_0.4.0                    HDF5Array_1.20.0              blob_1.2.1                   
[161] singscore_1.12.0              AnnotationHub_3.0.1           BiocSingular_1.8.0            memoise_2.0.0                
[165] irlba_2.3.3                   future.apply_1.7.0           
---
title: "2_DataIntegration_Vignette"
author: "Matthew A. Cottam"
date: "8/8/2021"
output:
  html_notebook: 
    theme: united
    toc_float: yes
    toc: yes
    fig_height: 6
    number_sections: yes
  html_document: 
    toc: yes
    number_sections: yes
    keep_md: true
---
# In this vignette
In this data integration vignette, we will be taking the output from the Preprocessing vignette and merging our unique samples together into one unified Seurat object. We will also perform dimensional reduction on the integrated Seurat object. Herein, we will do the following:

1. Normalize the RNA and ADT assays
2. Identify integration features
3. Run PCA on each sample independently
4. Integrate data using rPCA
5. Perform non-linear dimensional reduction
6. Visualize unclustered dimensional reduction

There are multiple methods for data integration, some of which have been incorporated into Seurat V4. In this experiment, all four samples were prepared simultaneously and ultimately run on the 10X Chromium platform on the same chip. Therefore, we expect batch effects to be very small. For computational efficiency, we have integrated these data sets using the `rPCA` (reciprocal PCR) approach.

# Prepare the R environment
## Load in the necessary libraries
```{r setup, message=FALSE, warning=FALSE}
library(Seurat)
library(ggplot2)
library(dplyr)
library(cowplot)
library(SeuratWrappers)
library(plotly)
library(RColorBrewer)
```

## Load in our Seurat object list
This `.rds` file comes from Vignette #1 - Data Preprocessing.
```{r readdata}
Seurat_list <- readRDS("../1_Preprocessing_Vignette/1_Preprocessing.rds")
```

# Prepare Seurat objects for integration
## Normalize RNA assay
First, we will normalize and find variable features for the RNA assay containing our gene expression data. The variable features will be used to determine integration features.

```{r normRNA}
Seurat_list <- lapply(X = Seurat_list, FUN = function(x) {
    x <- NormalizeData(x, assay="RNA", normalization.method = "LogNormalize", verbose = FALSE)
    x <- FindVariableFeatures(x, assay="RNA", selection.method = "vst", nfeatures = 2000, verbose = FALSE)
})
```

## Normalize ADT assay
Like for the `HTO` assay in the previous vignette, we will use CLR normalization for the `ADT` assay.
```{r normADT}
Seurat_list <- lapply(X = Seurat_list, FUN = function(x) {
    x <- NormalizeData(x, assay="ADT", normalization.method="CLR", verbose= FALSE)
    x <- ScaleData(x, assay= "ADT", verbose= FALSE)
})
```

## Select integration features
The rPCA approach requires scaling of the data and subsequent PCA of each Seurat object prior to integration. Here, we are scaling on the integration features to produce our PCA results.
```{r intfeats}
features <- SelectIntegrationFeatures(object.list = Seurat_list)
Seurat_list <- lapply(X = Seurat_list, FUN = function(x) {
    x <- ScaleData(x, features = features, verbose = FALSE)
    x <- RunPCA(x, features = features, verbose = FALSE)
})
```

# Perform integration using rPCA
Data integration retains metaData for each individual Seurat object. The “project name”, or diet in this experiment, is stored in the `orig.ident` metaData column.

## Determine integration anchors
```{r anchors}
options(future.globals.maxSize= 5242880000) #Integration requires a larger global size limit than default. I've set it to 5 Gb, but ~2 Gb was required.
anchors <- FindIntegrationAnchors(object.list = Seurat_list, anchor.features = features, reduction = "rpca", verbose=FALSE)
```

## Integrate objects and generated an integrated assay
A new assay, called `integrated` will be generated upon data integration. We will use this assay for clustering, but will always revert to the `RNA` assay for differential expression.
```{r integration}
data.integrated <- IntegrateData(anchorset = anchors)
```

# Dimensional Reduction
As mentioned before, the `integrated` data assay will be used for clustering and visualization.

## Scale the data and perform PCA on the integrated assay
PCA is a linear dimensional reduction approach that seeks to identify the most variable sets of features. I've chosen to run 50 PCs for this experiment.
```{r scaleint}
# Change default assay
DefaultAssay(data.integrated) <- "integrated"

# Scale the data and perform PCA on the integrated assay
data.integrated <- ScaleData(data.integrated, assay="integrated", verbose=FALSE)
data.integrated <- RunPCA(data.integrated, assay="integrated", npcs = 50, verbose=FALSE)
```

## Identify how many dimensions to use
Multiple methods can be used to determine how many PCs to use for downstream analysis. Here, we use both a statistical approach `JackStraw` and a heuristic approach `ElbowPlot` as recommended in Seurat’s PBMC3k tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html.

### JackStraw
First, we will calculate and score via JackStraw. Warning: Jackstraw takes a long time to run.
```{r jackstraw, fig.width=10, fig.height=7}
# Calculate the JackStraw visualization
data.integrated <- JackStraw(data.integrated, assay= "integrated", dims= 50, num.replicate= 100, verbose= FALSE)

# Score JackStraw 
data.integrated <- ScoreJackStraw(data.integrated, dims = 1:50)

# Plot JackStraw
JackStrawPlot(data.integrated, dims=1:50) + theme(legend.position="bottom")
```

### ElbowPlot
The `ElbowPlot` simply ranks PCs by percentage of variance.
```{r elbow, fig.width=6, fig.height=5}
ElbowPlot(data.integrated, ndims=50)
```

## Nonlinear Dimensional Reduction
Uniform Manifold Projection (UMAP) is the preferred visualization for large datasets. It runs more efficiently than tSNE and preserves global structure much better. However, both work well for visualization and we run both to store them as embeddings in case we need to look at both. We use all 50 PCAs, since all appear to be significant. 

Note: We calculate here both 2-dimensional and 3-dimensional embeddings. 3-dimensional embeddings can be useful to capture interactions between cell types that are otherwise very different when only looking at two dimensions.

Warning: UMAP in particular seems extremely sensitive to changes in OS and to any updates in the underlying software. These changes seem to result in rotations or flips across axes, even when setting the same seed.

### Calculate UMAP
```{r umap}
data.integrated <- RunUMAP(data.integrated, dims=1:50, verbose=FALSE)
data.integrated <- RunUMAP(data.integrated, dims=1:50, n.components=3, reduction.name="umap3D", reduction.key="UMAP3D_", verbose=FALSE)
```

### Calculate tSNE
```{r tsne}
data.integrated <- RunTSNE(data.integrated, dims=1:50, verbose=FALSE)
data.integrated <- RunTSNE(data.integrated, dims=1:50, dim.embed=3, reduction.name="tsne3D", reduction.key="tSNE3D_")
```

# Data Projection

## 2D visualization
```{r 2dumap, fig.width=8, fig.height=4}
d1 <- DimPlot(data.integrated, reduction="umap", label=F) + NoLegend() + labs(subtitle="UMAP")
d2 <- DimPlot(data.integrated, reduction="tsne", label=F) + NoLegend() + labs(subtitle="tSNE")
d1+d2
```


## 3D visualization
For 3-dimensional visualization, we use the `plotly` packaged to produce an interactive plot.
### UMAP
```{r 3dumap}
plot.data <- FetchData(object = data.integrated, vars = c("UMAP3D_1", "UMAP3D_2", "UMAP3D_3"))
plot_ly(data = plot.data, #3D interactive plot of the data
        x = ~UMAP3D_1, y = ~UMAP3D_2, z = ~UMAP3D_3, 
        type = "scatter3d", 
        mode = "markers", 
        marker = list(size = 0.5, width=0.5),
        hoverinfo="text")
```

### tSNE
```{r 3dtsne}
plot.data2 <- FetchData(object = data.integrated, vars = c("tSNE3D_1", "tSNE3D_2", "tSNE3D_3"))
plot_ly(data = plot.data2, #3D interactive plot of the data
        x = ~tSNE3D_1, y = ~tSNE3D_2, z = ~tSNE3D_3,
        type = "scatter3d", 
        mode = "markers", 
        marker = list(size = 0.5, width=0.5),
        hoverinfo="text")
```

# Save Progress
```{r save}
saveRDS(data.integrated, file="2_DataIntegration.rds")
```

# Session Information
```{r session}
sessionInfo()
```

